{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "1dcb4058",
      "metadata": {
        "id": "1dcb4058"
      },
      "source": [
        "## 2. Install Dependencies and Import Libraries"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a755cc2a",
      "metadata": {
        "id": "a755cc2a"
      },
      "source": [
        "Run the cell below to install Git LFS, which we use to download our dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1891cad9",
      "metadata": {
        "id": "1891cad9"
      },
      "outputs": [],
      "source": [
        "!git lfs install"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c4899e7a-43ef-4ae7-8f12-0024037a0b43",
      "metadata": {
        "id": "c4899e7a-43ef-4ae7-8f12-0024037a0b43"
      },
      "source": [
        "Install and import Python dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f2d18e80",
      "metadata": {
        "id": "f2d18e80"
      },
      "outputs": [],
      "source": [
        "!pip install \"ragas==0.1.4\" pypdf \"arize-phoenix[llama-index]\" \"openai>=1.0.0\" pandas"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "02304338",
      "metadata": {
        "id": "02304338"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "# Display the complete contents of dataframe cells.\n",
        "pd.set_option(\"display.max_colwidth\", None)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a6a8385c",
      "metadata": {
        "id": "a6a8385c"
      },
      "source": [
        "## 3. Setup\n",
        "\n",
        "Set your OpenAI API key if it is not already set as an environment variable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "534f85a3",
      "metadata": {
        "id": "534f85a3"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
        "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
        "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f31e2923",
      "metadata": {
        "id": "f31e2923"
      },
      "source": [
        "Launch Phoenix in the background and setup auto-instrumentation for llama-index and LangChain so that your OpenInference spans and traces are sent to and collected by Phoenix. [OpenInference](https://github.com/Arize-ai/openinference/tree/main/spec) is an open standard built atop OpenTelemetry that captures and stores LLM application executions. It is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context, such as retrieval from vector stores and the usage of external tools such as search engines or APIs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9373c880",
      "metadata": {
        "id": "9373c880"
      },
      "outputs": [],
      "source": [
        "import phoenix as px\n",
        "\n",
        "(session := px.launch_app()).view()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3e3c8b98",
      "metadata": {
        "id": "3e3c8b98"
      },
      "outputs": [],
      "source": [
        "from openinference.instrumentation.langchain import LangChainInstrumentor\n",
        "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n",
        "from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter\n",
        "from opentelemetry.sdk.trace import SpanLimits, TracerProvider\n",
        "from opentelemetry.sdk.trace.export import SimpleSpanProcessor\n",
        "\n",
        "endpoint = \"http://127.0.0.1:6006/v1/traces\"\n",
        "tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))\n",
        "tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))\n",
        "\n",
        "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)\n",
        "LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "78f707d3-e921-4f81-bbfb-a2ddb917c79d",
      "metadata": {
        "id": "78f707d3-e921-4f81-bbfb-a2ddb917c79d"
      },
      "source": [
        "## 4. Generate Your Synthetic Test Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3d52a38d",
      "metadata": {
        "id": "3d52a38d"
      },
      "source": [
        "Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1dd4ce7f",
      "metadata": {
        "id": "1dd4ce7f"
      },
      "source": [
        "Run the cell below to download a dataset of prompt engineering papers in PDF format from arXiv and read these documents using LlamaIndex."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "548a0aba-a055-4262-8bd2-ee9e11cfd3b9",
      "metadata": {
        "id": "548a0aba-a055-4262-8bd2-ee9e11cfd3b9"
      },
      "outputs": [],
      "source": [
        "!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ea5e2125-3d3a-4a09-b307-24ab443087d3",
      "metadata": {
        "id": "ea5e2125-3d3a-4a09-b307-24ab443087d3"
      },
      "outputs": [],
      "source": [
        "from llama_index.core import SimpleDirectoryReader\n",
        "\n",
        "dir_path = \"./prompt-engineering-papers\"\n",
        "reader = SimpleDirectoryReader(dir_path, num_files_limit=2)\n",
        "documents = reader.load_data()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0909a561",
      "metadata": {
        "id": "0909a561"
      },
      "source": [
        "An ideal test dataset should contain data points of high quality and diverse nature from a similar distribution to the one observed during production. Ragas uses a unique evolution-based synthetic data generation paradigm to generate questions that are of the highest quality which also ensures diversity of questions generated.  Ragas by default uses OpenAI models under the hood, but you’re free to use any model of your choice. Let’s generate 100 data points using Ragas."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b4d7e1d0-4c6e-4fd8-bfb8-be7b42d3de1e",
      "metadata": {
        "id": "b4d7e1d0-4c6e-4fd8-bfb8-be7b42d3de1e"
      },
      "outputs": [],
      "source": [
        "from phoenix.trace import using_project\n",
        "from ragas.testset.evolutions import multi_context, reasoning, simple\n",
        "from ragas.testset.generator import TestsetGenerator\n",
        "\n",
        "TEST_SIZE = 5\n",
        "\n",
        "# generator with openai models\n",
        "generator = TestsetGenerator.with_openai()\n",
        "\n",
        "# set question type distribution\n",
        "distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}\n",
        "\n",
        "# generate testset\n",
        "with using_project(\"ragas-testset\"):\n",
        "    testset = generator.generate_with_llamaindex_docs(\n",
        "        documents, test_size=TEST_SIZE, distributions=distribution\n",
        "    )\n",
        "test_df = (\n",
        "    testset.to_pandas().sort_values(\"question\").drop_duplicates(subset=[\"question\"], keep=\"first\")\n",
        ")\n",
        "test_df.head(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9bb9ffac",
      "metadata": {
        "id": "9bb9ffac"
      },
      "source": [
        "You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ded50764-cd14-402b-93fd-0e8377b88ddd",
      "metadata": {
        "id": "ded50764-cd14-402b-93fd-0e8377b88ddd"
      },
      "source": [
        "## 5. Build Your RAG Application With LlamaIndex"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ff9c7460",
      "metadata": {
        "id": "ff9c7460"
      },
      "source": [
        "LlamaIndex is an easy to use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e1eba224",
      "metadata": {
        "id": "e1eba224"
      },
      "outputs": [],
      "source": [
        "from llama_index.core import VectorStoreIndex\n",
        "from llama_index.embeddings.openai import OpenAIEmbedding\n",
        "from phoenix.trace import using_project\n",
        "\n",
        "\n",
        "def build_query_engine(documents):\n",
        "    vector_index = VectorStoreIndex.from_documents(\n",
        "        documents,\n",
        "        embed_model=OpenAIEmbedding(),\n",
        "    )\n",
        "    query_engine = vector_index.as_query_engine(similarity_top_k=2)\n",
        "    return query_engine\n",
        "\n",
        "\n",
        "with using_project(\"indexing\"):\n",
        "    # By assigning a project name, the instrumentation will send all the embeddings to the indexing project\n",
        "    query_engine = build_query_engine(documents)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6b3a10b4",
      "metadata": {
        "id": "6b3a10b4"
      },
      "source": [
        "If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a dataframe for visualization later in the notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3b4c26fb",
      "metadata": {
        "id": "3b4c26fb"
      },
      "outputs": [],
      "source": [
        "from time import sleep\n",
        "\n",
        "# Wait a little bit in case data hasn't beomme fully available\n",
        "sleep(2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c5c6e3bc",
      "metadata": {
        "id": "c5c6e3bc"
      },
      "outputs": [],
      "source": [
        "from phoenix.trace.dsl.helpers import SpanQuery\n",
        "\n",
        "client = px.Client()\n",
        "corpus_df = px.Client().query_spans(\n",
        "    SpanQuery().explode(\n",
        "        \"embedding.embeddings\",\n",
        "        text=\"embedding.text\",\n",
        "        vector=\"embedding.vector\",\n",
        "    ),\n",
        "    project_name=\"indexing\",\n",
        ")\n",
        "corpus_df.head(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "59e745b4",
      "metadata": {
        "id": "59e745b4"
      },
      "source": [
        "## 6. Evaluate Your LLM Application"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "df6acfc5",
      "metadata": {
        "id": "df6acfc5"
      },
      "source": [
        "Ragas provides a comprehensive list of metrics that can be used to evaluate RAG pipelines both component-wise and end-to-end.\n",
        "\n",
        "To use Ragas, we first form an evaluation dataset comprised of a question, generated answer, retrieved context, and ground-truth answer (the actual expected answer for the given question)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e2597314-d6de-412d-b00c-3e00297746e2",
      "metadata": {
        "id": "e2597314-d6de-412d-b00c-3e00297746e2"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "from datasets import Dataset\n",
        "from phoenix.trace import using_project\n",
        "from tqdm.auto import tqdm\n",
        "\n",
        "\n",
        "def generate_response(query_engine, question):\n",
        "    response = query_engine.query(question)\n",
        "    return {\n",
        "        \"answer\": response.response,\n",
        "        \"contexts\": [c.node.get_content() for c in response.source_nodes],\n",
        "    }\n",
        "\n",
        "\n",
        "def generate_ragas_dataset(query_engine, test_df):\n",
        "    test_questions = test_df[\"question\"].values\n",
        "    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]\n",
        "\n",
        "    dataset_dict = {\n",
        "        \"question\": test_questions,\n",
        "        \"answer\": [response[\"answer\"] for response in responses],\n",
        "        \"contexts\": [response[\"contexts\"] for response in responses],\n",
        "        \"ground_truth\": test_df[\"ground_truth\"].values.tolist(),\n",
        "    }\n",
        "    ds = Dataset.from_dict(dataset_dict)\n",
        "    return ds\n",
        "\n",
        "\n",
        "with using_project(\"llama-index\"):\n",
        "    ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)\n",
        "\n",
        "ragas_evals_df = pd.DataFrame(ragas_eval_dataset)\n",
        "ragas_evals_df.head(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "87117e89",
      "metadata": {
        "id": "87117e89"
      },
      "source": [
        "Check out Phoenix to view your LlamaIndex application traces."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8f0d6aea",
      "metadata": {
        "id": "8f0d6aea"
      },
      "outputs": [],
      "source": [
        "print(session.url)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2a671393",
      "metadata": {
        "id": "2a671393"
      },
      "source": [
        "![LlamaIndex application traces inside of Phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/ragas/ragas_trace_slide_over.gif)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c843f75d",
      "metadata": {
        "id": "c843f75d"
      },
      "source": [
        "We save out a couple of dataframes, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8faaa5a4",
      "metadata": {
        "id": "8faaa5a4"
      },
      "outputs": [],
      "source": [
        "from time import sleep\n",
        "\n",
        "# Wait a bit in case data hasn't beomme fully available\n",
        "sleep(2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2098cd28",
      "metadata": {
        "id": "2098cd28"
      },
      "outputs": [],
      "source": [
        "# dataset containing embeddings for visualization\n",
        "query_embeddings_df = px.Client().query_spans(\n",
        "    SpanQuery().explode(\"embedding.embeddings\", text=\"embedding.text\", vector=\"embedding.vector\"),\n",
        "    project_name=\"llama-index\",\n",
        ")\n",
        "query_embeddings_df.head(2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d9b6ba24",
      "metadata": {
        "id": "d9b6ba24"
      },
      "outputs": [],
      "source": [
        "from phoenix.session.evaluation import get_qa_with_reference\n",
        "\n",
        "# dataset containing span data for evaluation with Ragas\n",
        "spans_dataframe = get_qa_with_reference(client, project_name=\"llama-index\")\n",
        "spans_dataframe.head(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a6b96c87",
      "metadata": {
        "id": "a6b96c87"
      },
      "source": [
        "Ragas uses LangChain to evaluate your LLM application data. Since we initialized the LangChain instrumentation above we can see what's going on under the hood when we evaluate our LLM application."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bfc94272",
      "metadata": {
        "id": "bfc94272"
      },
      "source": [
        "Evaluate your LLM traces and view the evaluation scores in dataframe format."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a9feae80",
      "metadata": {
        "id": "a9feae80"
      },
      "source": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cc5bf278-b3ea-4e2a-9653-f724f41c067e",
      "metadata": {
        "id": "cc5bf278-b3ea-4e2a-9653-f724f41c067e"
      },
      "outputs": [],
      "source": [
        "from phoenix.trace import using_project\n",
        "from ragas import evaluate\n",
        "from ragas.metrics import (\n",
        "    answer_correctness,\n",
        "    context_precision,\n",
        "    context_recall,\n",
        "    faithfulness,\n",
        ")\n",
        "\n",
        "# Log the traces to the project \"ragas-evals\" just to view\n",
        "# how Ragas works under the hood\n",
        "with using_project(\"ragas-evals\"):\n",
        "    evaluation_result = evaluate(\n",
        "        dataset=ragas_eval_dataset,\n",
        "        metrics=[faithfulness, answer_correctness, context_recall, context_precision],\n",
        "    )\n",
        "eval_scores_df = pd.DataFrame(evaluation_result.scores)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4eae5015",
      "metadata": {
        "id": "4eae5015"
      },
      "source": [
        "Submit your evaluations to Phoenix so they are visible as annotations on your spans."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1610a987",
      "metadata": {
        "id": "1610a987"
      },
      "outputs": [],
      "source": [
        "# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).\n",
        "span_questions = (\n",
        "    spans_dataframe[[\"input\"]]\n",
        "    .sort_values(\"input\")\n",
        "    .drop_duplicates(subset=[\"input\"], keep=\"first\")\n",
        "    .reset_index()\n",
        "    .rename({\"input\": \"question\"}, axis=1)\n",
        ")\n",
        "ragas_evals_df = ragas_evals_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
        "test_df = test_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
        "eval_data_df = pd.DataFrame(evaluation_result.dataset)\n",
        "eval_data_df = eval_data_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
        "eval_scores_df.index = eval_data_df.index\n",
        "\n",
        "query_embeddings_df = (\n",
        "    query_embeddings_df.sort_values(\"text\")\n",
        "    .drop_duplicates(subset=[\"text\"])\n",
        "    .rename({\"text\": \"question\"}, axis=1)\n",
        "    .merge(span_questions, on=\"question\")\n",
        "    .set_index(\"context.span_id\")\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "feb81616",
      "metadata": {
        "id": "feb81616"
      },
      "outputs": [],
      "source": [
        "from phoenix.trace import SpanEvaluations\n",
        "\n",
        "# Log the evaluations to Phoenix under the project \"llama-index\"\n",
        "# This will allow you to visualize the scores alongside the spans in the UI\n",
        "for eval_name in eval_scores_df.columns:\n",
        "    evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: \"score\"})\n",
        "    evals = SpanEvaluations(eval_name, evals_df)\n",
        "    px.Client().log_evaluations(evals)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e16699fd",
      "metadata": {
        "id": "e16699fd"
      },
      "source": [
        "If you check out Phoenix, you'll see your Ragas evaluations as annotations on your application spans."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a7c25cfa",
      "metadata": {
        "id": "a7c25cfa"
      },
      "outputs": [],
      "source": [
        "print(session.url)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "95f44224",
      "metadata": {
        "id": "95f44224"
      },
      "source": [
        "![ragas evaluations appear as annotations on your spans](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/ragas/ragas_evaluation_annotations.gif)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1c74e381",
      "metadata": {
        "id": "1c74e381"
      },
      "source": [
        "## 8. Recap\n",
        "\n",
        "Congrats! You built and evaluated a LlamaIndex query engine using Ragas and Phoenix. Let's recap what we learned:\n",
        "\n",
        "- With Ragas, you bootstraped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.\n",
        "- With OpenInference, you instrumented your query engine so you could observe the inner workings of both LlamaIndex and Ragas.\n",
        "- With Phoenix, you collected your spans and traces, imported your evaluations for easy inspection, and visualized your embedded queries and retrieved documents to identify pockets of poor performance.\n",
        "\n",
        "This notebook is just an introduction to the capabilities of Ragas and Phoenix. To learn more, see the [Ragas](https://docs.ragas.io/en/stable/) and [Phoenix docs](https://docs.arize.com/phoenix/).\n",
        "\n",
        "If you enjoyed this tutorial, please leave a ⭐ on GitHub:\n",
        "\n",
        "- [Ragas](https://github.com/explodinggradients/ragas)\n",
        "- [Phoenix](https://github.com/Arize-ai/phoenix)\n",
        "- [OpenInference](https://github.com/Arize-ai/openinference)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "rag-venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
